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Abstract 

Empirical Bayes 1 (EB) estimation is a widely used procedure to calculate teacher value-added. It is primarily viewed as a way to make 
imprecise estimates more reliable. In this paper we review the theory of EB estimation and use simulated data to study its ability to 
properly rank teachers. We compare the performance of EB estimators with that of other widely used value-added estimators under 
different teacher assignment scenarios. We find that, although EB estimators generally perform well under random assignment of 
teachers to classrooms, their performance generally suffers under non-random teacher assignment. Under nonrandom assignment, 
estimators that explicitly (if imperfectly) control for the teacher assignment mechanism perform the best out of all the estimators 
we examine. We also find that shrinking the estimates, as in EB estimation, does not itself substantially boost performance. 
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Abstract: Empirical Bayes’ (EB) estimation has become a popular procedure used to calculate 
teacher value-added, often as a way to make imprecise estimates more reliable. In this paper we 
review the theory of EB estimation and use simulated and real student achievement data to study 
the ability of EB estimators to properly rank teachers. We compare the performance of EB esti- 
mators with that of other widely used value-added estimators under different teacher assignment 
scenarios. We find that, although EB estimators generally perform well under random assign- 
ment of teachers to classrooms, their performance suffers under nonrandom teacher assignment. 
Under nonrandom assignment, estimators that explicitly (if imperfectly) control for the teacher 
assignment mechanism perform the best out of all the estimators we examine. We also find that 
shrinking the estimates, as in EB estimation, does not itself substantially boost performance. 



1 Introduction 


Empirical Bayes’ (EB) estimation of teacher effects has gained recent popularity in the value- 
added research community (see, for example, McCaffrey et al. 2004; Kane & Staiger 2008; Chetty, 
Friedman, & Rockoff forthcoming; Corcoran, Jennings, & Beveridge 2011; and Jacob & Lefgren 
2005, 2008). Researchers motivate the use of EB estimation as a way to decrease classification 
error of teachers, especially when limited data are available to compute value-added estimates. 
Since teacher value-added estimates can be very noisy when there are only a small number of 
students per teacher, EB estimates of teacher value-added reduce the variability of the estimates 
by shrinking them toward the average estimated teacher effect in the sample. As the degree of 
shrinkage depends on class size, estimates for teachers with smaller class sizes are more affected, 
potentially helping with the misclassification of these teachers. EB, or “shrinkage, ” estimation 
may also be less computationally demanding than methods that view the teacher effects as fixed 
parameters to estimate. Finally, EB estimation has been motivated as a way to estimate teacher 
value added when including controls for peer effects and other classroom-level covariates. 

This paper analyzes the performance of EB estimation using both simulated and real student 
achievement data. We first provide a detailed theoretical derivation of the EB estimator, which 
has not previously been explicitly derived in the context of teacher value-added. This theoretical 
discussion provides the basis for our expectations about how EB and other value-added estimators 
will perform under the different simulation scenarios we examine. We test our theoretical predic- 
tions by comparing the performance of EB estimators to estimators that treat the teacher effect as 
fixed. We first use a simulation, where the true teacher effect is known, comparing performance 
under random teacher assignment and various nonrandom assignment scenarios. In addition to the 
random vs. fixed teacher effects comparison, we also examine whether shrinking the estimates 
improves performance. Finally, we apply these estimators to real student achievement data to see 
how the rankings of teachers vary across these estimators in a real-world setting. 
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Despite the potential benefits of EB estimation, we find that the estimated teacher effects can 
suffer from severe bias under nonrandom teacher assignment. By treating the teacher effects as 
random, EB estimation assumes that teacher assignment is uncorrelated with factors that predict 
student achievement - including observed factors such as past test scores. While the bias (tech- 
nically, the inconsistency) disappears as the number of students per teacher increases - because 
the EB estimates converges to the so-called fixed effects estimates - the bias still can be important 
with the type of data used to estimate teacher VAMs. This is because the EB estimators of the 
coefficients on the covariates in the model are inconsistent for fixed class sizes as the number of 
classrooms grows. By contrast, estimators that include the teacher assignment indicators along 
with the covariates in a multiple regression analysis are consistent (as the number of classrooms 
grows) for the coefficients on the covariates. This generally leads to less bias in the estimated 
teacher VAMs under nonrandom assignment without many students per teacher. 

The paper begins in Section 2 with a detailed theoretical derivation of the EB estimator. Section 
3 follows with a description of the five estimators we examine. Section 4 describes our simula- 
tion design and the different student grouping and teacher assignment scenarios we examine, with 
Section 5 providing the results of this analysis. Section 6 provides an analysis of these estimators 
using real student achievement data, and Section 7 concludes. 

2 Empirical Bayes’ Estimation 

There are several ways to derive Empirical Bayes’ estimators of teacher value added. We adopt 
a so-called “mixed estimation” approach, as in Ballou, Sanders, and Wright (2004), because it is 
fairly straightforward and does not require delving into Bayesian estimation methods. Our focus 
is on estimating teacher effects grade by grade. Therefore, we assume either that we have a single 
cross section or multiple cohorts of students for each teacher. We do not include cohort effects; 
multiple cohorts are allowed by pooling students across cohorts for each teacher. 
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Let yj denote a measure of achievement for student i randomly drawn from the population. This 
measure could be a test score or a gain score (i.e., current minus lagged score). Suppose there are G 
teachers and the teacher effects are b g , g = 1, ...,G. In the mixed effects setting, the b g are treated 
as random variables drawn from a population of teacher effects, as opposed to fixed population 
parameters. Viewing the b g as random variables independent of other observable factors affecting 
test scores has consequences for the properties of EB estimators. 

Typically VAMs are estimated while controlling for other factors, which we denote by a row 
vector x/. These factors include prior test scores and, in some cases, student-level and/or classroom- 
level covariates. We treat the coefficients on these covariates as fixed population parameters. We 
can write a mixed effects linear model as 


yi — x /7 + z,b + Ui, (1) 

where z, is a 1 x G row vector of teacher assignment dummies, b is the G x 1 vector of teacher 
effects, and u/ contains the unobserved student-specific effects. Because a student is assigned to 

one and only one teacher, zn + za h + Z/g = 1- Equation (1) is an example of a “mixed model” 

because it includes the usual fixed population parameters y and the random coefficients b. Even if 
there are no covariates, x,- typically includes an intercept. If x,y is only a constant, so x,y = y, then 
y is the average teacher effect and we can then assume £(b) = 0. This means that b g is the effect 
of teacher g net of the overall mean teacher effect. 

Equation (1) is written for a particular student i so that teacher assignment is determined by the 
vector z,\ A standard assumption is that, conditional on b - so for a given set of teachers available 
for assignment - (1) represents a linear conditional mean: 

E(yi\xi,Zi,b) = x/y + z,b, (2) 
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which follows from equation (1) and 


E(uj\xi,Zi,b) = 0. (3) 

An important implication of (3) is that Uj is necessarily uncorrelated with z,-, so that teacher assign- 
ment is not systematically related to unobserved student characteristics once we have controlled 
for the observed factors in x,-. 

If we assume a sample of N students assigned to one of G teachers, we can write (1) in matrix 
notation as 

y = Xy + Zb + u, (4) 

where y and u are N x 1, X is N x K, and Z is N x G. In order to obtain the best linear unbiased 
estimator (BLUE) of y and the best linear unbiased predictor (BLUP) of b, we assume that the 
covariates and teacher assignments satisfy a strict exogeneity assumption: 

E(uj\X,Z,b) = 0,i = (5) 

An implication of assumption (5) is that inputs and teacher assignment of other students do not 
affect the outcome of student i. 

Given assumption (5) we can write the conditional expectation of y as 

£(y|X,Z,b) = Xy + Zb (6) 

In the EB literature a standard assumption is 

b is independent of (X,Z), (7) 
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in which case 


£(y|X,Z) = Xy + Z£(b|X,Z) = Xy = E{ y|X) (8) 

because E( b|X,Z) = £(b) = 0. Assumption (7) has the implication that teacher assignment for 
student i does not depend on the quality of the teacher (as measured by the b g ). 

From an econometric perspective, the statement that is(y|X) = Xy means that y can be esti- 
mated in an unbiased way by an OLS regression of 

yi °n x,-, i = (9) 

Consequently, we can estimate the effects of the covariates x,- using a regression that completely 
ignores teacher assignment. As a practical matter, this has a very important implication when 
viewed from a classical, fixed parameters model - we are assuming that teacher assignment is 
uncorrelated with the covariates x,-. Correlation between teacher assignment and covariates in x,- is 
a potential source of bias in EB (and related) estimators of the teacher effects. 

Under (5) and (7), the OLS estimator of y is unbiased and consistent, but it is inefficient if we 
impose the standard classical linear model assumptions on u. In particular, if the error variance 
has the usual scalar structure, 


Var(u|X,Z,b) = Var( u) = cr^ l N , 


( 10 ) 


then 


Var(y|X,Z) 


£[(Zb + u)(Zb + u)'|X,Z] 

ZVar(b)Z' + Vcir{ u) = cr^ZZ' + cr^ I N , 
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where we also add the standard assumption that the elements of b are uncorrelated. 


Var(b) = o- 2 b I G , (11) 

and cr 2 is the variance of the teacher effects, b„ . 

Under the assumption that cr? and cr 2 (or at least their ratio) are known, the BLUE of y under 
the preceding assumptions is the generalized least squares (GLS) estimator, 


7* = [XV? ZZ' + orl\ N )- l XY l X{(T 2 b Zr + o*l N )~ l y. (12) 

The N x N matrix ZZ' is a block diagonal matrix with G blocks, where block g is an N g x N g 
matrix of ones and N g is the number of students taught by teacher g. The GLS estimator y* is the 
well-known “random effects” (RE) estimator popular from panel data and cluster sample analysis. 
However, it is important to understand that the “random effects” in this case are teacher effects, 
not student-specific effects. Also, like the OLS estimator from equation (9), the GLS estimator y* 
does not partial out the teacher assignment. 

Before we discuss y* further, it is helpful to write down the mixed effects model in perhaps a 
more common form. After students have been designated to classrooms, we can write y g i as the 
outcome for student i in class g and similarly for x g ,- and u g i. Then, for classroom g, we have 

y g i — + bg M g i = Xg/y + r g i , i — 1, ...,N g , (13) 

where r ? , = b g + u g j is the composite error term. In other words, the variation in y gl not explained 
by x gi is due to teacher and individual student effects, and both of these (/?„ and u gi ) are assumed 
to be independent of x ?; -. Equation (13) also makes it easy to see that the BLUE of y is the random 
effects estimator. Further, the assumption E(u g j\X g ,b g ) = 0 implies that covariates from student h 
do not affect the outcome of student i. We can also see that OLS pooled across i and g is unbiased 
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for y because we are assuming E(b g \X g ) = 0. 

What about estimation of b, the teacher effects? As shown in, say, Ballou, Sanders, and Wright 
(2004), the BLUP of b under assumptions (5), (7), and (10) is 


b* = (Z'Z + pI G )- 1 Z'(y - Xy*) = (Z'Z + pI G r‘Z'r*, (14) 


where p = cr 2 /crg, and r* = y - Xy* is the vector of residuals. Straightforward matrix algebra 
shows each b* can be expressed as 


N e 


i>; = («,+/»-' !>;■ = 


i=l 


N, 


N g + p 


0-7 


CT7 


I r„ = 


<rl + (<rl/N g ) 8 \crg + (o-l/Ng) 


\(y g -x 8 y*). 


(15) 


where 

N g 

F l = N 8 1 J] r *8i = yz-W* (16) 

i— 1 

is the average of the residuals r* ; . = y g j - x ?; y* within classroom g. 

To operationalize y* and b*, we must replace erg and erg with estimates. There are different 
ways to obtain estimates depending on whether one uses OLS residuals after an initial estimation 
or a joint estimation method. With the composite error defined as r g! - = b g + u g i, we can write 
07 = erg + crl . An estimator of cr ( 2 can be obtained from the usual sum of squared residuals from 
the OLS regression 

y gi on x gi , i = 1 ,...,N g , g = 1,...,G. (17) 


Call the residuals r g j. Then a consistent estimator is 


G N g 




80 


(18) 
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which is just the usual degrees-of-freedom (df) adjusted error variance estimator from OLS. 

To estimate <r} n write r g i~r g = u g i~u g , where r g is the within-teacher average, and similarly for 
u g . A standard result on demeaning a set of uncorrelated random variables with the same variance 
gives Var(u g i - u g ) = <x“( 1 - N~ l ) and so, for each g, E [X/I*] ( r gi ~ r g ) 2 ] = cr^(/V g ~ lj.When 
we sum across teachers it follows that 

(AT-G)2Z ( ^~^ )2 (19) 

g=l i= 1 

has expected value . To turn (19) into an estimator we can replace r g ,- with the OLS residuals, 
r g i, from the regression in (17). The estimator based on the OLS residuals is 

= <20) 

g=l i= 1 

With fixed class sizes and G getting large, the estimator that uses N in place of N - G is not 
consistent. Therefore, we prefer the estimator in equation (20), as it should have less bias in 
applications where G/N is not small. With many students per teacher the difference should be 
minor. We could also use N - G - K as a further df adjustment, but subtracting off K does not 
affect the consistency. 

Given (rj and d"“, we can estimate erg as 


<x 


2 

b 



( 21 ) 


In any particular data set - especially if the data have been generated to violate the standard as- 
sumptions listed above - there is no guarantee that expression (21) is nonnegative. A simple 
solution to this problem (and one used in software packages, such as Stata) is to set o~l = 0 when- 
ever 07 < 07 . In order to ensure this happens infrequently with multiple cohorts, we compute 07 
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by replacing r g with the average obtained for the particular cohort. This ensures that, for a given 

N — 9 

cohort, the terms X (= i (?gi ~ ?g)~ are as small as possible. In theory, if there are no cohort effects 
we could use an overall cohort mean for r g . But using cohort- specific means reduces the problem 
of negative cf? when the model is misspecified. 

An appealing alternative is to estimate cr? and cr„ jointly along with y, using software that 
ensures nonnegativity of the variance estimates. The most common approach to doing so is to 
assume joint normality of the teacher effects, b g , and the student effects, u g i, across all g and i 
- along with the previous assumptions. One important point is that the resulting estimators are 
consistent even without the normality assumption; so, technically, we can think of them as “quasi- 
” maximum likelihood estimators. The maximum likelihood estimator of cr^ has the same form as 
in equation (20), except the residuals are based on the MLE of y rather than the OLS estimator. A 
similar comment holds for the MLE of erg (if we do not constrain it to be nonnegative). See, for 
example, Hsiao (2003, Section 3.3.3). 

Unlike the GLS estimator of y, the feasible GLS (FGLS) estimator is no longer unbiased [even 
under assumptions (5) and (7)], and so we must rely on asymptotic theory. In the current context, 
the estimator is known to be consistent and asymptotically normal provided G — » oo with N g 
fixed. In simulations, Hansen (2007) shows that the asymptotic properties work well when G is 
roughly around 40 with N g of a similar magnitude, and even somewhat larger. Consequently, the 
asymtptotic approximations for the FGLS estimator of y should be reliable in the vast majority 
of VAM applications, which are typically applied at the district or state level with a large number 
of teachers. In any case, our focus in this paper is not on estimation of y but rather the teacher 
effects, b. Often we can estimate b well, at least for ranking purposes, even when our estimator of 
y is severely biased. For estimating b, the number of students per teacher is what matters most. In 
fact, without x, in the equation, it is only the number of students per teacher that matters. In our 
simulations we use relatively few teachers, 36, because adding more teachers does not change our 
ability to estimate the effects for a particular teacher. 
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When y* is replaced with the FGLS estimator and the variances crj } and 07 are replaced with 
estimators, the EB estimator of b is no longer a BLUP. Nevertheless, we use the same formula 
as in (15) for operationalizing the BLUPs. Conveniently, certain statistical packages - such as 
Stata 12 with its “xtmixed” command - allow one to recover the operationalized BLUPs after 
maximum likelihood estimation. When we use the (quasi-) MLEs to obtain the b*, we obtain what 
are typically called the Empirical Bayes’ estimates. 

One way to understand the shrinkage nature of b* is to compare it with the estimator obtained 
by treating the teacher effects as fixed parameters. Let 7 and fi be the OLS estimators from the 
regression 

y ; - on x;, Zj, i = (22) 

Then 7 is the so-called “fixed effects” (FE) estimator obtained by a regression of 7/ on the controls 
in x/ and the teacher assignment dummies in z,-. As with the “random effects” terminology it is 
important to understand that regression (22) incorporates teacher fixed effects, not student fixed 
effects. In the context of the classical fixed parameters model 

7 = Xy + Zj3 + u (23) 

E(u|X,Z) = 0, Ear (u|X,Z) = cr^Ijv, 

7 is the BLUE of 7 and ji is the BLUE of (3. As is well-known, 7 can be obtained by an OLS 
regression where y g j and x ? ; have been deviated from within-teacher averages (see, for example, 
Wooldridge 2010, Chapter 10). Further, the estimated teacher fixed effects can be obtained as 


4 = Sg - x s?- (24) 

Equation (24) makes computation of the teacher VAMs efficient if one does not want to run the 
long regression in (22). 
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By comparing equations (15) and (24), we see that the EB estimator b* differs from the fixed 

/V 

effects estimator /3 g in two ways. First, and most importantly, the RE estimator y is used in 
computing b* while [3 g uses the FE estimator y. Second, b* shrinks the average of the residuals 
toward zero by the factor 

crl 1 

(25) 


crl + (o-l/N g ) 1 + (p/N g ) 


where 


P = 


(26) 


Equation (25) illustrates the well-known result that the smaller the number of students taught by 
teacher g, N g , the more the average residual is shrunk toward zero. 

A well-known algebraic result - for example, Wooldridge (2010, Chapter 10) - that holds for 
any given number of teachers G is that 


7 


y as p — > 0 or N g — > oo . 1 


(27) 


Equation (27) can be verified by noting that the RE estimator of y can be obtained from the pooled 
OLS regression 

y g i - d g y g on x gi - e g x g (28) 


where 


1/2 


e e 


i-i 


cr„ 


yl + Hgo-l, 


1 - 


1 


1/2 


(29) 


i + Wp), 

It is easily seen that 9 g — » 1 as p —> 0 or N„ — > oo. In other words, with many students per teacher 
or a large teacher effect variance relative to the student effect variance, the RE and FE estimates 
can be very close - but never identical. Not coincidentally, the shrinkage factor in equation (25) 
tends to unity if p — » 0 or N s — > oo. The bottom line is that with a “large” number of students 
per teacher the shrinkage estimates of the teacher effects can be close to the fixed effects estimates. 
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The RE and FE estimates also tend to be similar when cr 2 (the student effect) is “small” relative to 
erg (the teacher effect), but is an unlikely scenario. 

An important point that appears to go unnoticed in applying the shrinkage approach is that, in 
situations where y* and y substantively differ, y* suffers from systematic bias because it assumes 
teacher assignment is uncorrelated with x,-. Because y* is used in constructing the b* in equation 
(15), the bias in y* generally results in biased teacher effects, which would be biased even if (15) 
did not employ a shrinkage factor. The shrinkage likely exacerbates the problem: the estimates are 
being shrunk toward values that are systematically biased for the teacher effects. 2 

The expressions in equations (15) and (24) motivate a common two-step alternative to the EB 
approach and fixed effects approaches. In the first step of the procedure, one obtains y using the 
OLS regression in equation (17), and obtains the residuals, r gi . In the second step, one averages the 
residuals f g; - within each teacher to obtain the teacher effect for teacher g. These estimated teacher 
effects can be expressed as 

fig = N g ^ r g i = y g — x g y, (30) 

i=l 

which has the same form as (24) with the important difference that y is used in place of y. We call 
this approach the “average residual” (AR) method. After obtaining the averages of the residuals 
one can, in a third step, shrink the averages using the empirical Bayes’ shrinkage factors in equation 
(15). This “shrunken average residual” (SAR) method typically obtains the shrinkage factors using 
the estimates in equations (18) and (20). 

With or without shrinking, the AR approach suffers from systematic bias if teacher assignment, 
z i, is correlated with the co variates, x, . In effect, the AR approach partials x,- out of y, but does not 
partial x; out of z the latter of which is crucial if x,- and z, are correlated. The so-called “fixed 
effects” regression in (22) partials x; out of z ,, which makes it a more reliable estimator under 
nonrandom teacher assignment - perhaps much more reliable with strong forms of nonrandom 
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assignment. Since the fixed effects estimation of the teacher VAMs allows any correlation between 
z i and Xj, we thus expect it to outperform EB estimation and strongly outperform SAR under 
nonrandom assignment. The bias due to nonrandom allocation of students to teachers is also 
discussed in Rothstein (2009, 2010). 

It is also important to know that the SAR approach is inferior to the EB approach under nonran- 
dom assignment. The logic is simple. First, the algebraic relationship between RE and FE means 
that y* tends to be closer to the FE estimator, y, than the OLS estimator, y. Consequently, under 
nonrandom teacher assignment, the estimated teacher effects using the RE estimator of y will have 
less bias than the estimates that begin with OLS estimation of y. Second, if teacher assignment is 
uncorrelated with the covariates, the OLS estimator of y is inefficient relative to the RE estimator 
under the standard random effects assumptions (because the RE estimator is FGLS). Thus, the only 
possible justification for SAR is computational simplicity when the number of controls in x,- is very 
large. For the kinds of data sets widely available, the computational saving from using SAR rather 
than EB is likely to be minor. 

3 Summary of Estimation Methods 

In this paper we examine five different value-added estimators used to recover the teacher 
effects and apply them to both real and simulated data. Some of the estimators use EB or shrinkage 
techniques, while others do not. They can all be cast as a special case of the estimators described 
in the previous section. For clarity, we briefly describe each one, with additional reference to each 
of these specifications provided in Table A.l. Associated Stata 12 code is available upon request. 

The estimators can be obtained from a dynamic equation of the form 

An = 'f Ay- 1 + X,- f <5 + Z it /S + vn, (31) 
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in which A it is achievement (measured by a test score) for student i in grade t. X it is a vector 
of student characteristics, and Z ;f is the vector of teacher assignment dummies. This is similar 
to equation (1) but with the lagged test score written separately from X ;r for clarity. Also, X ;r 
is omitted from the estimation of the teacher effects in the simulation analysis below, as student 
characteristics are not included in the data generating process. The EB estimator we analyzed in 
Section 2 was for the case of a single cross-section of students, and so we only use one grade - 
fifth grade - for the analysis. 

We first analyze EB LAG, a dynamic MLE version of the EB estimator that treats the teacher 
effects as random. This technique obtains the estimates of the teacher effects using the normal 
maximum likelihood in the first stage, where the error includes teacher random effects (along with 
the student- specific error). In the second stage, the shrinkage factor is applied to these teacher 
effects. As described in Rabe-Hesketh and Skrondal (2012), this two step procedure can be per- 
formed in one-step using the “xtmixed” command in Stata 12 with teacher random effects. The 
predicted random effects of this regression are identical to shrinking the MLE estimates by the 
shrinkage factor. This procedure is generally justified even if the unobservables do not have nor- 
mal distributions, in which case we are applying quasi-MLE. A second estimator we consider is 
the average residual (AR) method described in Section 2, which is obtained by first using the OLS 
regression in (17) and then using (30). Recall that the AR method essentially differs from EB LAG 
in that it uses OLS to estimate y in the first stage. 3 We expect the EB LAG estimator to outperform 
the AR estimator in most scenarios, given that EB LAG generally uses a more reliable estimator 
of y. 

We compare the AR and EB estimators with estimators that partial out teacher assignment 
when estimating y, thereby allowing teacher assestimators that are correlated with lagged test 
scores and student characteristics. This third estimator is obtained by simply applying OLS to 
(31), by pooling across students and classrooms. We call this the “dynamic OLS” or “DOLS” 
estimator. The inclusion of the lagged test score accounts for the possibility that teacher assignment 
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is related to students’ most recent test score. Guarino, Reckase, and Wooldridge (forthcoming) 
discuss the assumptions under which DOLS consistently estimates ft when (31) is obtained from 
a structural cumulative effects model, and the assumptions are quite restrictive. More importantly, 
their simulations show the DOLS estimator often estimates [i well, at least for ranking purposes, 
even when the assumptions underlying the consistency of DOLS fails. 

Given that EB estimation is often motivated as a way to increase precision and decrease mis- 
classification, we also analyze whether shrinking AR and DOLS estimates enhances performance. 
Thus, the fourth estimator we analyze is our shrunken average residual (SAR) estimator. This es- 
timator takes the AR estimates from (17) and shrinks them by the shrinkage factor described in 
equation (25). Shrinking the AR estimates does not result in a true EB estimator since AR uses 
OLS in the first stage, but it is commonly used as a simpler way of operationalizing the EB ap- 
proach (see, for example, Kane and Staiger, 2008). As discussed in Section 2, with a sufficiently 
large number of students per teacher, the EB LAG estimator converges to the DOLS estimator, 
but SAR does not. Thus, as the number of students per teacher grows, we would expect EB LAG 
to perform more similarly to DOLS than SAR. Finally, we consider a shrunken DOLS (SDOLS) 
estimator, which takes the DOLS estimated teacher fixed effects and shrinks them by the shrinkage 
factor derived in Section 2. Although SDOLS is rarely done in practice and is not a true EB esti- 
mator, we include it as an exploratory exercise in order to better determine the effects of shrinking 
itself when the number of students per teacher differs. When the class sizes are all the same, the 
shrunken estimates (SDOLS and SAR) will only differ from the unshrunken estimates by a con- 
stant positive multiple. Thus, shrinking the DOLS or AR estimates will have no effect in terms of 
ranking teachers. It is important to keep in mind that, unlike DOLS and SDOLS, the AR and SAR 
estimators do not allow for general correlation between the teacher assignment and past test scores 
(or other covariates). 
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4 Comparing YAM Methods Using Simulated Data 


The question of which VAM estimators perform the best can only be addressed in simulations 
where the true teacher effects are known. Therefore, to evaluate the performance of EB estimators 
relative to other common value-added estimators, we apply these methods to simulated data. This 
approach allows us to examine how well various estimators recover the true teacher effects under a 
variety of student grouping and teacher assignment scenarios. Using data generated as described in 
Section 4.1, we apply the value-added estimators discussed in Section 3 and compare the resulting 
estimates with the true teacher effects. 

4.1 Simulation Design 

Much of our main analysis focuses on a base case that restricts the data generating process to a 
relatively narrow set of idealized conditions (e.g., no measurement error, no peer effects, constant 
teacher effects. However, we do relax some of these conditions as sensitivity tests of the main 
results. The data are constructed to represent grades three through five (the tested grades) in a 
hypothetical school. For simplicity and comparison with the theoretical predictions, we assume 
that the learning process has been going on for a few years but only calculate estimates of teacher 
effects for fifth grade teachers - a single cross section . 4 We create data sets that contain students 
nested within teachers, with students followed longitudinally over time in order to reflect the in- 
stitutional structure of an elementary school. Our simple baseline data generating process is as 
follows: 

A /3 = A A /2 + /?;3 + Ci + C/3 

A /4 = A A /3 + (3/4 + Ci + e ,4 (32) 

A /5 = A A , 4 + (3(5 + Ci + C /5 
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in which A , 2 is a baseline score reflecting the subject-specific knowledge of child i entering third 
grade; A/ t is the grade-/ test score (/ = 3,4,5); A is a time constant decay parameter, where lambda 
is set equal to zero in the simulations for lag scores greater than one year prior; j3, t is the teacher- 
specific contribution to growth (the true teacher value-added effect); q is a time-invariant student- 
specific effect (may be thought of as “ability” or “motivation”); and e !t is a random deviation for 
each student. We assume independence of e, t over time, a restriction implying that past shocks 
to student learning decay at the same rate as all inputs (see Guarino, Reckase, and Wooldridge, 
forthcoming, for a more detailed discussion of this “common factor restriction” assumption). In 
all of the simulations reported in this paper, the random variables A/ 2 , [i/ t , q, and e/ t are drawn 
from normal distributions with zero means. The standard deviation of the teacher effect is .25, the 
standard deviation of the student fixed effect is .5, and the standard deviation of the random noise 
component is 1. These give relative shares of 5, 19, and 76 percent of the total variance in gain 
scores (when A = 1), respectively. Given that the student and noise components are larger than 
the teacher effects, we call these “small” teacher effects. We also conduct a sensitivity analysis 
using “large” teacher effects, where the true teacher effects are drawn from a distribution with a 
standard deviation of 1. The baseline score is drawn from a distribution with a standard deviation 
of 1. We also allow for correlation between the time-invariant child- specific heterogeneity, q, and 
the baseline test score, A/ 2 , which we set to 0.5. This correlation reflects that students with better 
unobserved “ability" likely have higher test scores as well. 

Our data are simulated using 36 teachers per grade and 720 students per cohort. For estimating 
teacher effects, it is the number of student per teacher that is important. The number of teachers 
only impacts the precision of the estimates of A and the population variances and, as discussed 
earlier, results in Hansen (2007) indicate that 36 teachers is sufficient. In order to create a situation 
in which there is a substantial variation in class size - to showcase the potential disparities between 
EB/shrinkage and other estimators - we vary the number of students per classroom. Teachers 
receive classes of varying sizes, but receive the same number of students in each cohort. The 
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size of class each teacher receives is random, but ensures that twelve teachers have classes of 10 
students, twelve teachers have a class size of 20, and twelve teachers have class sizes of 30. We 
simulate the data using both one and four cohorts of students to provide further variation in the 
amount of data from which the teacher effects are calculated. In the case of four cohorts, data are 
pooled across the cohorts so that value-added estimates are based off of sample sizes of 40, 80, and 
120, instead of 10, 20, and 30 as in the one cohort case. Therefore, we would expect that estimates 
in the four cohort case to be less noisy than those from the one cohort case, possibly mitigating the 
potential gains from EB estimation. 

To create different scenarios, we vary two key features: the grouping of students into classes 
and the assignment of classes of students to teachers within schools. We generate data using each 
of the seven different mechanisms for the assignment of students outlined in Table A. 2. Students 
are grouped into classrooms either randomly, based on their prior year achievement level (dynamic 
grouping or DG), or based on their unobserved heterogeneity (heterogeneity grouping or HG). In 
the random case, students are assigned a random number and then grouped into classrooms of 
various sizes based on that random number. In the grouping cases, students are ranked by either 
the prior test score or their fixed effect and grouped into classrooms of various sizes based on that 
ranking. Teachers are assigned to these classrooms either randomly (denoted RA) or nonrandomly. 
Teachers assigned nonrandomly can be assigned positively (denoted PA), meaning the best teachers 
are assigned to classrooms with the best students, or negatively (denoted NA), meaning the best 
teachers are assigned to classrooms with the worst students. 

These grouping and assignment procedures are not purely deterministic, as we allow for a ran- 
dom component with standard deviation of 1 in the assignment mechanism. As a sensitivity analy- 
sis, we also set this standard deviation to 0.1, meaning the grouping of students into classrooms is 
more deterministic. We use the estimators discussed in Section 3, but with only a constant, teacher 
dummies (if applicable), and the lagged test score included as covariates. We use 100 Monte Carlo 
replications per scenario in evaluating each estimator. 
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4.2 Evaluation Measures 


For each estimator across each iteration, we save the individual estimated teacher effects and 
also retain the true teacher effects, which are fixed across the iterations for each teacher. To study 
how well the methods recover the true teacher effects, we adopt five simple summary measures 
using the teacher level data. The first is a measure of how well the estimates preserve the rankings 
of the true teacher effects. We compute the Spearman rank correlation, p, between the estimated 
teacher effects and the true effects and report the average p across the 100 iterations. Second, we 
compute a measure of misclassification. These misclassification rates are obtained as the percent- 
age of above average teachers in the true quality distribution (i.e., teachers with true fS g > 0) who 
are misclassified as below average in the distribution of estimated teacher effects for the given esti- 
mator. Given that this is just an arbitrary cutoff point, we also obtain the fraction of teachers in the 
outside tails of the distribution that are incorrectly classified (e.g., fraction of teachers that are in 
the bottom quintile of the true distribution, but estimated to lie in one of the other four quintiles). 

In addition to examining rank correlations and misclassification rates, it is also helpful to have a 
measure that quantifies some notion of the magnitude of the bias in the estimates. Given that some 
teacher effects are biased upwards and others downwards, it is difficult to capture the overall bias 
in the estimates in a simple way. For each simulation, we create a statistic, §, that captures how 
closely the magnitude of the deviation of the estimates from their mean tracks the magnitude of the 
deviation of the true effects from the true mean. To calculate this measure, we regress the deviation 
of the estimated teacher effects from their overall estimated means on the analogous deviation of 
the true effects generated from the simulation - for each estimator. We can represent this simple 
regression as 

fig - 1 3 = 9((3g - fi) + residualg, (33) 

in which fi„ is the estimated teacher effect and /3 g is the true effect of teacher g. From this sim- 
ple regression, we report the average coefficient, 0, across the 100 replications of the simulation 
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for each estimator. This regression tells us whether the estimated teacher effects are correctly dis- 
tributed around the average teacher. If 6 = 1, then a movement of p g away from its mean is tracked 
by the same movement of from its mean. Because the estimated teacher effects are deviated 
from the overall mean of the estimated effect, 0 will not pick up additive bias that affects each 
teacher effect in the same way. However, one is not typically concerned about such biases if the 
estimated effects are used for comparisons among teachers. 

When 0 « 1, it makes sense to compare the magnitudes of the estimated teacher effects across 
teachers. If 0 > 1, the estimated teacher effects amplify the true teacher effects. In other words, 
teachers above average will be estimated to be even more above average and vice versa for below 
average teachers. An estimation method that produces 0 substantially above one generally does 
a good job of ranking teachers, but the magnitudes of the differences in estimated teacher effects 
cannot be trusted. The magnitudes also cannot be trusted if 6 < 1, and ranking the teachers 
generally becomes more difficult since the estimated effects are compressed relative to the true 
teacher effects. In some policy applications, the relative magnitudes of the estimated teacher effects 
might be important, and so we report the average value of 9 across the simulations. Doing so 
allows us to determine scenarios where the magnitudes of estimated teacher effects are meaningful. 
Further, the measure provides insight into why some methods rank teachers relatively well even 
when the estimated effects are systematically biased, often quite badly. 

The precision of these methods is also a key consideration when evaluating the overall perfor- 
mance. As described in Section 2, EB methods are not unbiased when the teacher effects are treated 
as fixed parameters we are trying to estimate. However, if the identifying assumptions hold, these 
methods should provide more precise estimates. This is one motivation for using EB methods, as 
estimates should be more stable over time, leading to a smaller variance in the teacher effects. As 
the teacher effect is fixed for each teacher across the 100 iterations, we have 100 estimates of each 
teacher effect. As a summary measure for the precision of the estimators, we calculate the standard 
deviation of the 100 teacher effect estimates for each teacher and then take a simple average across 
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all teachers. 


Finally, to further analyze the variance-bias tradeoff for each of these estimators, we also in- 
clude the average mean squared error (MSE). This measure averages the following across all g 
teachers and across simulation runs: 


MSE, = (/3 g - (3 g ) 2 (34) 

This provides a simple statistic to determine whether the bias induced by shrinking is justifiable 
due to gains in precision. 

5 Simulation Results 

Tables 1 and 2 report the five evaluation measures described in Section 4.2 for each particu- 
lar estimator-assignment scenario combination. For ease in interpreting the tables, a quick guide 
to the descriptions of each of these estimators, grouping-assignment mechanisms, and evaluation 
measures can be found in Appendix tables A.l through A. 3. As these shrinkage and EB estimators 
are often motivated as a way to reduce noise, one might expect these approaches to be most bene- 
ficial with very limited student data per teacher. Thus, we estimate teacher effects using both four 
cohorts and one cohort of data. The tables show results for the case A = .5. Though not reported, 
we also conducted a full set of simulations for A = 0.75 and A = 1, and the main conclusions are 
unchanged. The full set of simulation results is available upon request from the authors. 

5.1 Fixed Teacher Effects versus Random Teacher Effects 

In Table 1, we first compare the performance of the DOLS estimator, which treats teacher 
effects as fixed parameters to estimate, to the AR and EB LAG estimators that treat teacher effects 
as random. Under nonrandom assignment of teachers, we expect DOLS, which explicitly controls 
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for teacher assignment through the inclusion of teacher assignment indicators, to perform better 
than those estimators treating the teacher effects as random. When teacher assignment is based on 
the lagged test score, DOLS directly controls for the assignment mechanism by including both the 
lagged score and teacher dummies and should perform well in this case. The simulation results 
presented here largely support this hypothesis. 

5.1.1 Random Assignment 

We begin with the pure random assignment (RA) case (i.e., the case of no teacher sorting), 
where EB-type estimation methods are theoretically justified. The results of the random assign- 
ment case are given in the top panel of Table 1, and they suggest very little substantial difference 
between the performance of the fixed and random effects estimators under this scenario. As the 
theory suggests, EB LAG performs well in the four cohort case, with rank correlations between 
the estimated and the true teacher effects near 0.86, which is nearly the same as the 0.85 rank 
correlation for DOLS and AR. In addition to very similar rank correlations, the misclassification 
rates are very similar across the three estimators, with about 15 percent of above average teachers 
misclassified as below average. These estimators also misclassify 28 percent of the teachers that 
should be classified in the bottom quintile. The similarities between the three estimators in terms 
of rank correlation and misclassification rates remains when using only one cohort. Reducing the 
amount of data used to estimate the teacher effects lowers the performance of all estimators, de- 
creasing the rank correlations and increasing the misclassification rates. With one cohort, rank 
correlations between the estimated and true teacher effects are about 0.65 to 0.67, and between 25 
and 26 percent of above average teachers are misclassified as below average. 

In addition to rank correlations and misclassification rates, we also examine the bias and preci- 
sion of the estimators. While DOLS and AR appear to be unbiased with average 6 values close to 
1, EB LAG substantially underestimates the magnitudes of the true teacher effects with an average 
6 value of 0.78 using four cohorts and 0.49 using one cohort. This bias is likely the result of the 
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shrinkage technique that is applied, but this shrinkage does cause EB LAG to be slightly more 
precise than AR or DOLS. While DOLS and AR both have similar average standard deviations 
of the estimated teacher effects near 0.13 and 0.27 in the four and one cohort cases, respectively, 
EB LAG has lower average standard deviations of 0.12 and 0.18, respectively. Given the precision 
gain in EB LAG, the MSE measure suggests that EB LAG may be preferred to DOLS or AR under 
random assignment. 

We now move to the cases where the students are nonrandomly grouped together, but teachers 
are still randomly assigned to classrooms. We allow for nonrandom grouping based on either 
the prior year test score (dynamic grouping, DG) or student-level heterogeneity (heterogeneity 
grouping, HG). Under these DG-RA and HG-RA scenarios in Table 1, we see a fairly similar 
pattern as in the RA scenario, although the overall performance of all estimators is somewhat 
diminished, especially in the HG-RA scenario. 

5.1.2 Dynamic Grouping and Nonrandom Assignment 

The performance of the various estimators diverges noticeably under nonrandom teacher as- 
signment. We continue to nonrandomly group teachers as described above, but now allow for 
nonrandom assignment of students to teachers. Classes with high test scores or high unobserved 
ability can be assigned to either the best (positive assignment - PA) or worst (negative assignment 
- NA) teachers. A key finding of this analysis is the disparity in performance between the DOLS 
estimator and estimators that fail to allow for correlation between the teacher assignment and the 
assignment mechanism (e.g., AR and EB LAG). These results suggest that, when there is non- 
random teacher assignment based on the prior test score, estimators explicitly controlling for the 
teacher assignment should be preferred. 

Similar results hold for both DG-PA and DG-NA, so we focus only on the DG-PA results here. 
Under the DG-PA scenario, DOLS substantially outperforms AR and EB LAG. When using four 
cohorts, DOLS has a rank correlation of 0.86 under DG-PA, while AR and EB LAG have rank 
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correlations of 0.60 and 0.76, respectively. AR and EB LAG also have large misclassification 
rates, with 28 to 32 percent of above average teachers being misclassified as below average com- 
pared with only 23 percent for DOLS. Although not listed in the table, DOLS also misclassifies 
fewer teachers in the bottom quintile - DOLS only misclassifies 28 percent of the teachers that 
should be classified in the bottom quintile, while EB LAG and AR misclassify 39 and 49 percent, 
respectively. 

In addition to misclassifying and poorly ranking teachers, the AR and EB LAG methods also 
underestimate the magnitudes of the true teacher effects. While DOLS has an average 6 value of 
0.99, the AR and EB LAG estimators have average 6 values of 0.53 and 0.49, respectively. While 
some of the bias of the EB LAG estimates can be attributed to shrinkage, the larger issue is the 
bias caused by the failure of the AR and EB LAG approaches to allow for correlation between the 
lagged test score (i.e., the assignment mechanism in these scenarios) and the teacher assignment, a 
correlation that DOLS explicitly allows for with the inclusion of teacher dummies in the regression. 
Which estimator is preferred based on MSE differs depending on the number of cohorts. In the four 
cohort case, DOLS, EB LAG and AR have MSE values of 0.018, 0.037, and 0.024, respectively. 
When only one cohort is used, EB LAG has the smallest MSE of 0.051, while DOLS and AR have 
MSE values of 0.074 and 0.091, respectively. Despite the gain in precision, the average bias across 
all teachers (based on p g - fi g ) and simulation reps in EB LAG is nearly three times that of DOLS. 
Given the poor teacher rankings for EB LAG in this case, the consequences of the extreme bias 
cannot be ignored, even if the MSE measure suggests it should be preferred. 

These simulation results also verify an important result of the theoretical discussion: the per- 
formance of EB LAG approaches the performance of DOLS as the number of students per teacher 
grows. We see less of a disparity in the performance of DOLS and EB LAG when computing 
VAMs using four cohorts compared to one, but the relative performance of AR does not improve 
with more students per teacher. Lor example, under DG-PA with one cohort of students, AR and 
EB LAG have rank correlations of 0.38 and 0.45, respectively, compared to 0.63 for DOLS. With 
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four cohorts of students, the rank correlation for EB LAG is much closer to that for DOLS (0.76 
and 0.86, respectively) than is the rank correlation for AR (0.60). This theoretical result is also ap- 
plicable to the SAR estimator we examine below, which is used as a simpler way to operationalize 
the EB approach. In summary, EB LAG, which uses random effects estimation in the first stage, is 
preferred to AR under nonrandom teacher assignment, as the EB estimates approach the preferred 
DOLS estimates that treat teacher effects as fixed. 

5.1.3 Heterogeneity Grouping and Nonrandom Assignment 

As a final scenario we examine the case of nonrandom teacher assignment to students grouped 
on the basis of student-level heterogeneity. The results for these HG scenarios are especially 
unstable: all estimators do an excellent job ranking teachers under positive teacher assignment 
and a very poor job under negative teacher assignment. In the HG-PA case with four cohorts 
of students, the magnitudes of the estimated VAMs are amplified as seen by the large average 
values for Q between 1.43 and 1.61. This improves the ability of the various estimators to rank 
teachers as evidenced by the high rank correlations of about 0.94 for all estimators. The EB LAG 
estimator performs the best in this scenario, as it performs as well as the other estimators in terms 
of ranking and misclassification of teachers but has the smallest MSE measure. Under HG-NA 
with four cohorts, the performance of all estimators falls substantially, largely caused by severely 
underestimated teacher effects (0 values between 0.15 and 0.33). These compressed teacher effect 
estimates make it difficult to rank teachers in this scenario, resulting in low rank correlations for 
all estimators between 0.38 and 0.41. Just as in the HG-PA scenario, the performance of the three 
estimators under HG-NA is very similar across the evaluation measures we examine. 

Why is the performance of DOLS, AR, and EB LAG so similar under HG-PA and HG-NA, 
while differing so greatly under DG-PA and DG-NA? Despite correlation between the baseline 
test score and the student fixed effect, the lagged test score appears to be a weak proxy for the 
assignment mechanism in the HG scenarios. Since none of the three estimators do well at allowing 
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for the correlation between the assignment mechanism and the teacher assignment in these cases, 
the distinction between estimators that include teacher fixed effects and those that treat teacher 
effects as random is less stark. As found in Guarino, Reckase, and Wooldridge (forthcoming), a 
gain score estimator with student fixed effects included is the most robust in these HG scenarios, 
as it does allow for the correlation between the assignment mechanism (i.e., student fixed effect) 
and the teacher assignment (i.e teacher dummy variables). Their results lend further support the 
conclusion that allowing for this correlation is extremely important in the performance of these 
value added estimators when there is nonrandom assignment. 

5.2 Shrinkage versus Non-Shrinkage Estimation 

Use of EB and other shrinkage estimators is often motivated as a way to reduce the noise in 
the estimation of teacher effects, particularly for teachers with a small number of students. Greater 
stability in the estimated effects is thought to reduce misclassification of teachers. We observed 
in section 3.1 that EB LAG was generally outperformed by the fixed effects estimator, DOLS. 
However, under nonrandom teacher assignment, we are unable to tell how much of the bias in the 
EB LAG estimator is due to treating the teacher effects as random and how much is due to the 
shrinkage procedure. To examine the effects of shrinkage itself, we compare the performance of 
unshrunken estimators, DOLS and AR, with their shrunken versions, SDOLS and SAR, in Table 
2. Although SDOLS is not a commonly used or theoretically justified estimator, we explore it here 
to identify whether shrinking teacher fixed effect estimates could be useful in practice. 

Our simulation results show that there is no substantial improvement in the performance of the 
DOLS or AR estimators after applying the shrinkage factor to the estimates. Using four cohorts 
of students, the performance measures for DOLS and AR compared to their shrunken counterparts 
are nearly identical to two decimal places across all grouping and assignment scenarios. Even with 
very limited data per teacher in the one cohort case, when we would expect shrinkage to have a 
greater effect on the estimates, we find very little change in the performance of the estimators after 
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the shrinkage factor is applied. 

In the one cohort case, shrinking either the DOLS or AR estimates slightly decreases (in the 
second decimal place) both the average 0 values and average standard deviation of the estimated 
teacher effects. This increased bias in the estimates is expected when applying the shrinkage 
factor and, depending on the scenario and estimator we examine, the effect of this precision-bias 
tradeoff may increase or decrease the MSE measure when comparing the shrunken and unshrunken 
estimates. Shrinking the DOLS and AR estimates generally reduces the MSE, due to increased 
precision, but makes no substantial difference on the misclassification rate of teachers, regardless 
of which misclassification rate we use. 

The effect of shrinkage itself does not appear to be practically important for properly ranking 
teachers or to ameliorate the performance of the biased AR estimator found in the DG-PA and 
DG-NA scenarios. Given that shrinking the AR estimates does little to mitigate the performance 
drop of AR under DG-PA and DG-NA, our evidence suggests that shrinking the DOLS estimates 
is preferred to the AR estimates, if such techniques are desired. 

5.3 Sensitivity Analyses 

As mentioned in Section 4.1, we also test the sensitivity of these results by changing some 
of the parameters of the model. First, we increase the standard deviation of the distribution from 
which the true teacher effects are drawn. As expected from the discussion in Section 2, when 
teacher effects are “large” EB LAG performs similarly to DOLS, while the AR method continues 
to suffer in performance under the DG-PA and DG-NA scenarios. Second, we allow for more 
non-randomness (i.e., decrease the amount of noise) in the assignment of teachers into classrooms. 
As the assignment of teachers becomes more deterministic, the performance of AR and EB LAG 
suffers even more in terms of lower rank correlations and higher misclassification rates than what 
is observed in the results in Table 1 and 2. Given that some models use multiple prior test scores 
(e.g., EVAAS, VARC), we also estimate DOLS, AR, and EB LAG with multiple lagged test scores 
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as a sensitivity analysis. Although adding multiple lags improves the performance of AR and EB 
LAG in the random assignment case, the performance of these estimators are still outperformed 
by DOLS in the DG-PA and DG-NA scenarios. As a final sensitivity test we include a peer effect 
(e.g., avg. Cj of student’s classmates) in the underlying DGP. Even when peer effects are included, 
EB LAG and AR continue to suffer in performance under the DG-PA and DG-NA cases. 

6 Comparing VAM Methods Using Real Data 

We also apply these estimation methods to actual student-level test score data and examine the 
rank correlations between the estimated teacher effects of the various estimators for each school 
district. In addition to rank correlations, we also examine whether teachers are being classified in 
the extremes uniformly across all of the estimators we examine. Although the real data does not 
allow comparison between the estimated effects and the true teacher effects, we are able to make 
comparisons between the estimated effects of the different estimators. This comparison provides a 
measure of the sensitivity of the estimated teacher effects to specifications that shrink the estimates 
and/or treat the teacher effects as random or fixed. The results of this analysis provide some 
perspective on the impact of shrinking and Empirical Bayes’ methods in a real-world setting. 

6.1 Data 

We apply the five methods described in Section 3 to data from an anonymous southern U.S. 
state. While state teacher evaluation systems often compute value-added for all teachers in the 
state, it is not uncommon for districts to conduct their own value-added analyses for high stakes 
decision-making. Statewide computation of value-added applies a one-size-fits all approach, choos- 
ing one estimator for teachers in all districts while largely ignoring the differences in assignment 
mechanisms across districts. Thus, within-district value-added calculations may better rank teach- 
ers than statewide systems. To compare with our simulated analysis we estimate teacher effects 
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district-by-district using equation (30), with math test scores as the dependent variable and controls 
for various student characteristics and dummies for the year. Student characteristics include race, 
gender, disability status, free/reduced price lunch eligibility, limited English proficiency status, 
and the number of student absences from school. Given that the simulations do not include student 
characteristics, we also conduct a sensitivity analysis that omits these variables. 

The data span 2001 through 2007 and grades four through six, but test scores from the annual 
assessment exam administered by the state are collected for each student from grades three through 
six. The data set includes 1,488,253 total students from which we have at least one current year 
score and one lagged score. Only 482,03 1 students have test scores for all grades. For simplicity 
and comparison with the simulation results, we estimate the value-added measures for the 20,749 
unique teachers with fifth grade students in the 67 districts, but again teachers receive multiple 
cohorts of students. While the average number of cohorts per teacher across the 67 districts is 3.88, 
we do observe 39 percent of teachers for only one year and an additional 20 percent of teachers 
for four or more years. On average, teachers have about 25 students per year, with only a small 
percentage (less than two percent) teaching more than 30 students per year. The high percentage 
of teachers that we observe for only one year could motivate researchers to employ shrinkage 
and EB estimators as a way to reduce precision problems due to minimal data. While there are 
seven very large districts with over 800 fifth grade teachers, the average number of fifth grade 
teachers in the other sixty districts is 172. In addition, 18 of the 67 districts have less than 36 total 
fifth grade teachers (the number we use in the simulation) suggesting that the simulation results 
are comparable to many of the smaller districts in the state. It is key to note that our sensitivity 
analysis that increased the number of simulated teachers to 72 yielded similar results, suggesting 
that our simulation is also representative of larger districts. 
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6.2 Results 


Figure 1 presents box plots that depict the distributions of the within-district rank correlations 
between the various lagged score estimators, DOLS, SDOLS, AR, SAR, and EB LAG. The results 
presented here are for math scores, but the results are similar when reading scores are used. The 
results presented here also include student characteristics. Although there is no change in the over- 
all conclusions if these are omitted, the distributions between all of the estimators are slightly more 
dispersed. As in the discussion of the simulation results, we first compare the DOLS estimator, 
which treats the teacher effects as fixed, with the estimators that treat the teacher effects as ran- 
dom. Comparing DOLS and AR, we find that the median rank correlation is around 0.99, but there 
are nine districts with rank correlations below 0.90 and 2 districts with correlations below 0.50. 
We also observe a slightly lower median rank correlation between DOLS and EB LAG, at around 
0.97, with five districts with rank correlations below 0.90 and three below 0.50. These results are 
not inconsistent with our simulation results: the performance of DOLS, AR, and EB LAG is very 
similar under cases of random assignment of teachers to classrooms, but the performance of AR 
and EB LAG is substantially different from DOLS under non-random assignment based on prior 
test scores. Thus, it could be the case that these outlier districts observed in the left tails of the top 
two box plots may be composed of schools that engage more heavily in nonrandom assignment of 
teachers to classrooms. 

Comparing the two estimators that do not explicitly control for the teacher assignment, AR 
and EB LAG, we find that while the median rank correlation is 0.96, nine districts have rank 
correlations of between 0.82 and 0.92. These results suggest that the estimates are somewhat 
sensitive to how the teacher effects are calculated in the first stage. This was also the case in the 
simulated results, where the performance of the AR estimator suffered more than the performance 
of the EB LAG estimator in cases of non-random assignment based on the prior test score. 

For a thorough comparison with the simulation results, we also compare the shrunken and 
unshrunken estimates of DOLS and AR using the real data. We find median rank correlations 
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of around 0.97 for both the DOLS and SDOLS comparison and the AR and SAR comparison, 
suggesting that shrinkage has a small impact on the estimates. It appears that in certain cases, 
shrinkage may have a larger impact on the DOLS estimates, as two districts have rank correlations 
of 0.50 and 0.72. Our simulation results suggested that shrinking the estimates had very little 
impact on estimator performance. 

In addition to rank correlation comparisons, we also examine the extent to which teachers 
are classified in the tails of the distribution by the different estimators. If shrinkage is having 
some effect, we would expect to see some teachers classified in the extremes to be pushed toward 
the middle of the distribution after applying the shrinkage factor. Table 3 lists the fraction of 
teachers ranked in the same quintile, either the top or bottom, by different pairs of estimators. 
Comparing across estimators that assume fixed teacher effects to those that assume random teacher 
effects, we do not see much movement across quintiles. For example, comparing DOLS to EB 
LAG, we find that about 91 percent of the teachers that are classified in the top quintile using 
DOLS are also in this quintile using EB LAG. This suggests that teacher assignment may not 
be largely based on prior student achievement or that the prior test score is a poor proxy for the 
true assignment mechanism. If the prior test score or other covariates insufficiently proxy for the 
underlying assignment mechanism, then the choice to include teacher assignment variables will 
matter little in how teachers are ranked. 

Comparing the rankings of unshrunken and corresponding shrunken estimators, we see that 
about 90 percent of teachers are ranked in the same quintile by both the unshrunken estimators 
(DOLS and AR) and their shrunken counterparts (SDOLS and SAR). This suggests that shrinking 
the estimates results in some reclassification of teachers in the tails to quintiles in the middle of 
the distribution. Using real data, however, we are unable to tell whether this reclassification is 
appropriate. Our simulated analysis suggested that shrinking the estimates had little impact if any 
on misclassification rates. 
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7 Conclusion 


Using simulation experiments where the true teacher effects are known, we have explored the 
properties of two commonly used Empirical Bayes’ estimators as well as the effects of shrinking 
estimates of teacher effects in general. Overall, EB methods do not appear to have much advan- 
tage, if any, over simple methods such as DOLS that treat the teacher effects as fixed, even in the 
case of random teacher assignment where EB estimation is theoretically justified. Under random 
assignment, all estimators perform well in terms of ranking teachers, properly classifying teachers, 
and providing unbiased estimates. EB methods have a very slight gain in precision compared to 
the other methods in this case. 

We generally find that EB estimation is not appropriate under nonrandom teacher assignment. 
The hallmark of EB estimation of teacher effects is to treat the teacher effects as random vari- 
ables that are independent (or at least uncorrelated) with any other covariates. This assumption 
is tantamount to assuming that teacher assignment does not depend on other covariates such as 
past test scores (this is also true for the AR methods). When teacher assignment is not random, 
estimators that either explicitly control for the assignment mechanism or proxy for it in some way 
typically provide more reliable estimates of the teacher effects. Among the estimators and assign- 
ment scenarios we study, DOLS and SDOLS are the only estimators that control for the assignment 
mechanism (again, either explicitly or by proxy) through the inclusion of both the lagged test score 
and teacher assignment dummies. As expected, DOLS and SDOLS outperform the other estima- 
tors in the nonrandom teacher assignment scenarios. In the analysis of the real data, we found that 
the rank correlations between, say, DOLS and EB LAG or DOLS and SAR are quite low for some 
districts, suggesting that the decision between these estimators is important. Thus, if there is a 
possibility of nonrandom assignment, DOLS should be the preferred estimator. 

As predicted by theory and seen in the simulation results, the random effects estimator, EB 
LAG, converges to the fixed effects estimator, DOLS, as the number of students per teacher gets 
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large. Therefore, it could be that EB LAG is performing well in large samples simply because 
the estimates are approaching the DOLS estimates. However, the average residual methods, AR 
and SAR, do not have this property. Thus, despite the recent popularity, we strongly caution using 
SAR as a simpler way to operationalize the EB LAG estimator. If EB-type methods are being used, 
possibly as a way to control for classroom-level covariates and peer effects with minimal data (a 
case that we do not consider in this paper), it is important to estimate the coefficients in the first 
stage using random effects estimation, as in our EB LAG estimator, rather than OLS. 

Lastly, we find that shrinking the estimates of the teacher effects does not seem to improve the 
performance of the estimators, even in the case where estimates are based on one cohort of students. 
The performance measures are extremely close in our simulations for those estimators that differ 
only due to the shrinkage factor - DOLS and SDOLS or AR and SAR. The rank correlations 
for these two pairs of estimators are also very close to one in almost all districts. Also, we find 
in the simulations that shrinking the AR estimates, which is a popular way to operationalize the 
EB approach, does not reduce misclassification of teachers. Thus, our evidence suggests that the 
rationale for using shrinkage estimators to reduce the misclassification of teachers due to noisy 
estimates of teacher effects should not be given much weight. Accounting for nonrandom teacher 
assignment when choosing among estimators is more imperative. 

Given the robust nature of the DOLS estimator to a wide variety of grouping and assignment 
scenarios, it should be preferred to AR and EB methods when there is uncertainty about the true 
underlying assignment mechanism. If the assignment mechanism is known to be random, applying 
these AR and EB estimators can be appropriate, especially when the amount of data per teacher 
is minimal. However, given that the assignment mechanism is not likely known, blindly applying 
these AR and EB methods can be extremely problematic, especially if teachers are truly assigned 
nonrandomly to classrooms. Therefore, we stress caution in applying theses AR and EB methods 
and urge researchers and practitioners to be mindful of the underlying assignment mechanism 
when choosing between the various value-added methods. 
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Notes 


1. Lockwood and McCaffrey (2007) have highlighted equation (27) in the context of student- 
level panel data, essentially appealing to the first edition of Wooldridge (2010). In the panel data 
setting (27) is arguably less relevant, as one rarely has more than a handful of time periods per stu- 
dent. For additional discussion of the relationship between random and fixed effects estimators, see 
Raudenbush (2009). In addition, Reardon and Raudenbush (2009) lay out the various assumptions 
underlying value-added estimation. 

2. Without covariates, the difference between the EB and fixed effects estimates of the b g is 
much less important: they differ only due to the shrinkage factor. In practice, the fixed effects 
estimates, /3 g , are obtained without removing an overall teacher average, which means j3 g = y g . 
To obtain a comparable expression for b* we must account for the GLS estimator of the mean 
teacher effect, which would be obtained as the intercept in the RE estimation. Call this estimator 
/u* b , which in the case of no covariates is y*. Then the teacher effects are 

b l = /4 + r h(y g ~ H* h ) = r] g y g + (1 - Tj g )f . 4 = y g -( 1 - Tig)(y g - n* b ), 

where 7] g is the shrinkage factor in equation (25). Compared with the FE estimate of b g , b* is 
shrunk toward the overall mean /j*,. When the teacher effects are treated as parameters to estimate, 
the b* are biased because of the shrinkage factor, even when they are BLUR 

3. While we obtained the expression that underlies AR estimation of y, E{ y|X) = Xy, by 
treating the teacher effects as random and independent of X and Z, the random effects structure is 
not used by the AR method. Thus, it is preferred to view the AR method as a regression-based 
approach that does not partial out teacher assignment when estimating y. By contrast, the EB 
approach exploits the random effects structure of the teacher effects to obtain the BLUE of y and 
the BLUP of the teacher effects. 
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4. Despite only estimating value-added for grade 5 teachers, we keep the three grade structure 
when generating the student test scores since the fifth grade achievement is based on more than 
just the current teacher and prior test score of the student; it is a function of all prior teacher, 
unobservable student, and random influences. Thus, to ignore that process and generate fifth grade 
test scores based on a “baseline” fourth grade test score seems inappropriate given this context. 
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Table 1: Simulation Results: Comparing Fixed and Random Teacher Effects Estimators 


A 

= 0.5 


Four Cohorts 


One Cohort 

G-A 

Mechanism 

Evaluation 

Type 

DOLS 

AR 

EB LAG 

DOLS 

AR 

EB LAG 


Rank Correlation 

0.85 

0.85 

0.86 

0.65 

0.65 

0.67 


Misclassification 

0.15 

0.15 

0.15 

0.25 

0.25 

0.26 

RA 

Avg. Theta 

1.01 

1.01 

0.78 

1.03 

1.03 

0.49 


Avg. Std. Dev. 

0.13 

0.14 

0.12 

0.28 

0.27 

0.18 


MSE 

0.018 

0.019 

0.013 

0.076 

0.076 

0.034 


Rank Correlation 

0.85 

0.85 

0.86 

0.64 

0.64 

0.65 


Misclassification 

0.15 

0.16 

0.16 

0.26 

0.25 

0.25 

DG-RA 

Avg. Theta 

1.01 

0.99 

0.77 

1.00 

0.98 

0.45 


Avg. Std. Dev. 

0.14 

0.14 

0.12 

0.27 

0.27 

0.19 


MSE 

0.019 

0.020 

0.014 

0.075 

0.071 

0.034 


Rank Correlation 

0.86 

0.60 

0.76 

0.63 

0.38 

0.45 


Misclassification 

0.15 

0.28 

0.22 

0.26 

0.35 

0.48 

DG-PA 

Avg. Theta 

0.99 

0.53 

0.49 

0.98 

0.52 

0.16 


Avg. Std. Dev. 

0.13 

0.19 

0.16 

0.27 

0.30 

0.22 


MSE 

0.018 

0.037 

0.024 

0.074 

0.091 

0.051 


Rank Correlation 

0.85 

0.62 

0.78 

0.67 

0.41 

0.48 


Misclassification 

0.14 

0.26 

0.20 

0.25 

0.34 

0.47 

DG-NA 

Avg. Theta 

1.01 

0.54 

0.53 

1.03 

0.54 

0.17 


Avg. Std. Dev. 

0.14 

0.19 

0.15 

0.27 

0.29 

0.22 


MSE 

0.019 

0.035 

0.022 

0.074 

0.086 

0.050 


Rank Correlation 

0.72 

0.73 

0.73 

0.58 

0.59 

0.60 


Misclassification 

0.23 

0.22 

0.23 

0.29 

0.29 

0.30 

HG-RA 

Avg. Theta 

1.02 

1.02 

0.86 

1.00 

0.99 

0.54 


Avg. Std. Dev. 

0.21 

0.21 

0.18 

0.32 

0.31 

0.21 


MSE 

0.046 

0.045 

0.033 

0.100 

0.097 

0.044 


Rank Correlation 

0.94 

0.93 

0.94 

0.81 

0.79 

0.81 


Misclassification 

0.09 

0.10 

0.10 

0.17 

0.18 

0.19 

HG-PA 

Avg. Theta 

1.61 

1.52 

1.43 

1.60 

1.51 

1.06 


Avg. Std. Dev. 

0.20 

0.19 

0.16 

0.31 

0.30 

0.19 


MSE 

0.042 

0.038 

0.027 

0.097 

0.092 

0.035 


Rank Correlation 

0.39 

0.38 

0.41 

0.26 

0.25 

0.28 


Misclassification 

0.35 

0.35 

0.38 

0.40 

0.41 

0.55 

HG-NA 

Avg. Theta 

0.33 

0.32 

0.15 

0.34 

0.33 

0.06 


Avg. Std. Dev. 

0.22 

0.23 

0.22 

0.32 

0.32 

0.24 


MSE 

0.050 

0.052 

0.047 

0.101 

0.102 

0.058 


Note: Rows of each scenario represent the following: 

First - Rank corr. of estimated effects and true effects 

Second - Fraction of above average teachers misclassified as below average 

Third - Average value of 9 

Fourth - Average standard deviation of estimated teacher effects across 100 reps 
Fifth - MSE measure 
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Table 2: Simulation Results: Comparing Shrunken and Unshrunken Estimators 


A 

= 0.5 


Four Cohorts 



One Cohort 


G-A 

Mechanism 

Evaluation 

Type 

DOLS 

SDOLS 

AR 

SAR 

DOLS 

SDOLS 

AR 

SAR 


Rank Correlation 

0.85 

0.85 

0.85 

0.85 

0.65 

0.66 

0.65 

0.66 

RA 

Misclassification 

0.15 

0.15 

0.15 

0.15 

0.25 

0.25 

0.25 

0.25 

Avg. Theta 

1.01 

1.01 

1.01 

1.01 

1.03 

0.99 

1.03 

0.99 


Avg. Std. Dev. 

0.13 

0.13 

0.14 

0.14 

0.28 

0.26 

0.27 

0.26 


MSE 

0.018 

0.018 

0.019 

0.019 

0.076 

0.068 

0.076 

0.068 


Rank Correlation 

0.85 

0.85 

0.85 

0.85 

0.64 

0.64 

0.64 

0.64 


Misclassification 

0.15 

0.15 

0.16 

0.16 

0.26 

0.25 

0.25 

0.25 

DG-RA 

Avg. Theta 

1.01 

1.01 

0.99 

0.99 

1.00 

0.96 

0.98 

0.94 


Avg. Std. Dev. 

0.14 

0.14 

0.14 

0.14 

0.27 

0.26 

0.27 

0.25 


MSE 

0.019 

0.019 

0.020 

0.020 

0.075 

0.067 

0.071 

0.064 


Rank Correlation 

0.86 

0.86 

0.60 

0.60 

0.63 

0.63 

0.38 

0.38 


Misclassification 

0.15 

0.15 

0.28 

0.28 

0.26 

0.27 

0.35 

0.36 

DG-PA 

Avg. Theta 

0.99 

0.99 

0.53 

0.53 

0.98 

0.92 

0.52 

0.49 


Avg. Std. Dev. 

0.13 

0.13 

0.19 

0.19 

0.27 

0.25 

0.30 

0.29 


MSE 

0.018 

0.018 

0.037 

0.037 

0.074 

0.064 

0.091 

0.081 


Rank Correlation 

0.85 

0.85 

0.62 

0.62 

0.67 

0.67 

0.41 

0.41 


Misclassification 

0.14 

0.14 

0.26 

0.26 

0.25 

0.25 

0.34 

0.34 

DG-NA 

Avg. Theta 

1.01 

1.01 

0.54 

0.53 

1.03 

0.97 

0.54 

0.51 


Avg. Std. Dev. 

0.14 

0.14 

0.19 

0.19 

0.27 

0.25 

0.29 

0.28 


MSE 

0.019 

0.019 

0.035 

0.035 

0.074 

0.063 

0.086 

0.077 


Rank Correlation 

0.72 

0.72 

0.73 

0.73 

0.58 

0.59 

0.59 

0.59 


Misclassification 

0.23 

0.23 

0.22 

0.22 

0.29 

0.29 

0.29 

0.29 

HG-RA 

Avg. Theta 

1.02 

1.02 

1.02 

1.02 

1.00 

0.96 

0.99 

0.96 


Avg. Std. Dev. 

0.21 

0.21 

0.21 

0.21 

0.32 

0.30 

0.31 

0.30 


MSE 

0.046 

0.046 

0.045 

0.045 

0.100 

0.091 

0.097 

0.088 


Rank Correlation 

0.94 

0.94 

0.93 

0.93 

0.81 

0.81 

0.79 

0.79 


Misclassification 

0.09 

0.09 

0.10 

0.10 

0.17 

0.17 

0.18 

0.18 

HG-PA 

Avg. Theta 

1.61 

1.61 

1.52 

1.52 

1.60 

1.56 

1.51 

1.46 


Avg. Std. Dev. 

0.20 

0.20 

0.19 

0.19 

0.31 

0.30 

0.30 

0.29 


MSE 

0.042 

0.042 

0.038 

0.038 

0.097 

0.089 

0.092 

0.084 


Rank Correlation 

0.39 

0.40 

0.38 

0.38 

0.26 

0.27 

0.25 

0.26 


Misclassification 

0.35 

0.35 

0.35 

0.35 

0.40 

0.41 

0.41 

0.41 

HG-NA 

Avg. Theta 

0.33 

0.33 

0.32 

0.32 

0.34 

0.32 

0.33 

0.31 


Avg. Std. Dev. 

0.22 

0.22 

0.23 

0.23 

0.32 

0.30 

0.32 

0.30 


MSE 

0.050 

0.049 

0.052 

0.051 

0.101 

0.091 

0.102 

0.091 


Note: Rows of each scenario represent the following: 

First - Rank corr. of estimated effects and true effects 

Second - Fraction of above average teachers misclassified as below average 

Third - Average value of 6 

Fourth - Average standard deviation of estimated teacher effects across 100 reps 
Fifth - MSE measure 
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Figure 1: Spearman Rank Correlations Across Different VAM Estimators 
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Table 3: Fraction of Teachers Ranked in Same Quintile by Estimator Pairs 



DOLS 

SDOLS 

AR 

SAR 



Top Quintile 


SDOLS 

0.91 




AR 

0.94 

0.89 



SAR 

0.89 

0.94 

0.91 


EB LAG 

0.87 

0.95 

0.86 

0.93 



Bottom Quintile 


SDOLS 

0.89 




AR 

0.96 

0.88 



SAR 

0.88 

0.95 

0.89 


EB LAG 

0.87 

0.98 

0.86 

0.96 
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A Appendix 


Table A.l: Description of Value-Added Estimators 


Estimator 

Acronym 

Description 

Teacher Effects 

Empirical Bayes’ 

EB LAG 

Two-step approach: Estimate teacher effects using MLE on dy- 
namic equation and then shrink estimates by shrinkage factor 

Random 

Average Residual 

AR 

Estimate dynamic equation by OLS and compute residuals for 
each student. Then compute the average of these residuals for 
each teacher to get estimated teacher effect 

Random 

Shrunken Avg. Residual 

SAR 

Two-step approach: Compute average residual for each teacher 
using residuals from OLS on dynamic equation. Then shrink 
average residual for each teacher by shrinkage factor 

Random 

Dynamic OLS 

DOLS 

Estimate teacher effects using ordinary least squares on dy- 
namic equation 

Fixed 

Shrunken DOLS 

SDOLS 

Two-step approach: Estimate teacher effects using dynamic 
equation and then shrink estimates by shrinkage factor 

Fixed 


Table A. 2: Definitions of Grouping-Assignment Mechanisms 


Name of G-A Mechanism 

Acronym 

Grouping students in 
classrooms 

Assigning students to teachers 

Random Assignment 

RA 

Random 

Random 

Dynamic Grouping - Random 
Assignment 

DG-RA 

Dynamic (based on prior test 
scores) 

Random 

Dynamic Grouping - Positive 
Assignment 

DG-PA 

Dynamic (based on prior test 
scores) 

Positive corr. between teacher effects and prior 
student scores 

Dynamic Grouping - Negative 
Assignment 

DG-NA 

Dynamic (based on prior test 
scores) 

Negative corr. between teacher effects and prior 
student scores 

Heterogeneity Grouping - Random 
Assignment 

HG-RA 

Static (based on student 
heterogeneity) 

Random 

Heterogeneity Grouping - Positive 
Assignment 

HG-PA 

Static (based on student 
heterogeneity) 

Positive corr. between teacher effects and 
student fixed effects 

Heterogeneity Grouping - Negative 
Assignment 

HG-NA 

Static (based on student 
heterogeneity) 

Negative corr. between teacher effects and 
student fixed effects 


Table A. 3: Description of Evaluation Measures of Value-Added Estimator Performance 


Evaluation Measure 

Description 

Rank Correlation 
Misclassification 
Average Theta 
Avg. Std. Dev. 
MSE 

Rank correlation between estimated teacher effect and true teacher effect 
Fraction of above average teachers that are misclassified as below average 
Average value of 6 

Average standard deviation of estimated teacher effects across the 100 simulation reps 
Average value of MSE = (J3j - 0j ) 2 
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